Eu Jin Lok
Kernel post for Speech Accent Archive on Kaggle
2 December 2019
In this notebook we will go into the details of how to explore audio data and converge on an objective, an objective which will most likely involve some kind of deep learning, because its awesome. If I do write a blog post about this, I will update this kernel. But for now, just the Jupyter notebook as a kernel, and my very first one!
Before we begin, I just wanted to say that my first real heavy involvement in audio was back in March 2018 whilst doing the Audio competition on kaggle, Thanks to that competition and the awesome community support, I had learnt alot and so I wanted to contribute back to the community in the same way. So without further ado, lets begin
import pandas as pd
import os
import math
import numpy as np
import matplotlib.pyplot as plt
import IPython.display as ipd # To play sound in the notebook
import librosa
import librosa.display
os.chdir(".../input")
#os.chdir("C:\\Users\\User\\Documents\\GIT\\Kaggle-Kernel-Speech-Accent-Archive\\")
After loading the files and setting our directory path, lets check out the meta datafile to see what we're dealing with
#load the data
df = pd.read_csv("speakers_all.csv", header=0)
# Check the data
print(df.shape, 'is the shape of the dataset')
print('------------------------')
print(df.head())
I noticed some strange empty columns in the last 3 columns of the dataset. Lets clean it up and run some more stats
df.drop(df.columns[9:12],axis = 1, inplace = True)
print(df.columns)
df.describe()
# Very rough plot
df['country'].value_counts().plot(kind='bar')
# Ok so that wasn't a very good idea. Lets try something else...
df['native_language'].value_counts().plot(kind='bar')
# That's lots of categories too! Ok so maybe lets try and visualise this in a different way...
df.groupby("native_language")['age'].describe().sort_values(by=['count'],ascending=False)
# Check country of origin again...
df.groupby("country")['age'].describe().sort_values(by=['count'],ascending=False)
There's more native languages than there are countries which I suppose makes sense. Although still a hypothesis withstanding. A sankey type plot here would be interesting but lets park it for now as a seperate task. Right now, lets continue on with our main objective...
# Create DTM of counts
df.groupby("sex")['age'].describe()
hmmm... must be a typo. Lets notify @Rachel Tatman about this observation. But for now, lets continue on
# birthplace
df.groupby("birthplace")['age'].describe().sort_values(by=['count'],ascending=False)
Birthplace is a very sparce datapoint with 1290 unique categories with very few observations in each one. Again could be interesting to see the patterns of Birthplace and Country relationship. Either a Network analysis or a Sankey plot. May shed some light on whether all the high Seoul birthplace observation equates to country. Ie. Could they be South Koreans living else where such as China or USA? And for the last one...
# file_missing
df.groupby("file_missing?")['age'].describe().sort_values(by=['count'],ascending=False)
32 missing files. What does this actually mean? I read the overview page and there's no mention of this. So, lets go see it for ourselves...
# Count the total audio files given
print (len([name for name in os.listdir('recordings\\') if os.path.isfile(os.path.join('recordings\\', name))]))
huh? We have 2 missing audio files. Well, I suppose the one sure way to tell is if we did a join between the
# file_missing column. This time we just print out the first 10 records.
df.groupby("filename")['age'].describe().sort_values(by=['count'],ascending=False).head(10)
Wait, there's some files that have the same filename. But closer inspection, I suspect these filenames also have missing audio files. In which case it is ok. So, lets have a look at the final column (Not exactly final but SpeakerID should be excusable right?)
# The file_missing? column. Again, just print the first 10 record
df.groupby("filename")['file_missing?'].describe().sort_values(by=['count'],ascending=False).head(10)
# pd.crosstab(df['filename'],df['file_missing?']) as an alternative method
The filename with duplicate names have all missing audio files. Perfect! Everything checks out. We can go ahead a read in the audio files, and listen in to a few. We'll look at 'arikaans1' and 'mandarin46' since its on our periperal vision
# Play afrikaans
fname1 = 'recordings\\' + 'afrikaans1.mp3'
ipd.Audio(fname1)
# Play mandarin46
fname2 = 'recordings\\' + 'mandarin46.mp3'
ipd.Audio(fname2)
# lets have a listen to a male voice.
print(df.groupby("filename")['sex'].describe().head(10))
fname3 = 'recordings\\' + 'agni1.mp3'
ipd.Audio(fname3)
Ok, so we've come to a point where we need to make a decision now. There's a few objectives worth pursuing on top of my head and they are:
All we could build all 3 applications above, starting from the easiest first being the gender predictor. The gender predictor will serve as our prototype and once we've built it, we'll expand to Country, followed by Birthplace. I'm not even sure if Birthplace is viable but lets re-evaluate when we circle back to this. For now, lets run with Gender first. Also note that we don't have to limit ourselves with supervised modelling. There's many more we can do:
There's alot you can do with audio, but we'll look at these at a later stage. Meantime, the show must go on. So lets stick to our simple objective, and lets now run a few more examples of male and female audio files. This time, I want to hear the US Southern Accent. Cause I've always liked that accent and find it fascinating.
print(df[df['birthplace'].str.contains("kentucky",na=False)])
fname4 = 'recordings\\' + 'english385.mp3'
ipd.Audio(fname4)
fname5 = 'recordings\\' + 'english462.mp3'
ipd.Audio(fname5)
The male version doesn't have a strong Southern accent. And there's some distrotion of the audio at the start. Could pose a problem for our accent predictor by Birthplace, but nothing to worry about for Gender. Looking at the data, seems like there's some potential age correlation here. So lets hear one final one!
fname6 = 'recordings\\' + 'english381.mp3' # An older male
ipd.Audio(fname6)
Ok so we'll go ahead as our first mini-objective and that is to create a gender predictor, with the ultimate objective being to create an accent predictor. So the next logical step after this is to analyse the audio files itself and extract features from it, which we'll do in the Part 2 of this series. Meantime, I'll leave you with the wave plots of the 3 Kentucky accents, can you tell the difference / similarity?
# Older female
y, sr = librosa.load(fname4)
plt.figure()
plt.subplot(3, 1, 3)
librosa.display.waveplot(y, sr=sr)
plt.title('older female')
# Older Male
y, sr = librosa.load(fname6)
plt.figure()
plt.subplot(3, 1, 3)
librosa.display.waveplot(y, sr=sr)
plt.title('older male')
# younger male
y, sr = librosa.load(fname5)
plt.figure()
plt.subplot(3, 1, 3)
librosa.display.waveplot(y, sr=sr)
plt.title('younger male')